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ABSTRACT 


Current deep convolution neural network (CNN) has shown to achieve 
superior performance on a number of computer vision tasks such as image 
recognition, classification and object detection. The deep network was also 
tested for view-invariance, robustness and illumination invariance. However, 
the CNN architecture has thus far only been tested on non-uniform 


illumination invariant. Can CNN perform equally well for very underexposed 
or overexposed images or known as uniform illumination invariant? This is 
the gap that we are addressing in this paper. In our work, we collected ear 
images under different uniform illumination conditions with lumens or lux 
values ranging from 2 lux to 10,700 lux. A total of 1,100 left and right ear 
images from 55 subjects are captured under natural illumination conditions. 
As CNN requires considerably large amount of data, the ear images are 
further rotated at every 5° angles to generate 25,300 images. For each 
subject, 50 images are used as validation/testing dataset, while the remaining 
images are used as training datasets. Our proposed CNN model is then 
trained from scratch and validation and testing results showed recognition 
accuracy of 97%. The results showed that 100% accuracy is achieved for 
images with lumens ranging above 30 but having problem with lumens less 
than 10 lux. 
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1, INTRODUCTION 

Convolutional Neural Networks (CNN) has been successfully applied in many recognition 
problems. The conceptual architecture of CNN is inspired by [1]’s seminal work on the cat’s striate cortex 
called receptive field. Later, Fukushima [2], explained the Neocognitron, which defines the layer wise 
structure of neural networks and explains the spatial invariance characteristic of simple cells and complex 
cells of visual primary cortex. LeCun introduced the structure of CNN for face and digit recognition [3-5] 
which demonstrated better recognition results than probability density function methodologies such as 
Gaussian Bayesian approaches and Gaussian Mixture models. They developed a multi-layer artificial neural 
network called LeNet-5 which could classify handwritten digits. Like other neural networks, LeNet-5 has 
multiple layers and can be trained with the backpropagation algorithm [6]. It can obtain effective 
representations of the original image, which makes it possible to recognize visual patterns directly from raw 
pixels with little-to-none preprocessing. Generally, CNNs are constructed by stacking interweaved layers of 
two types: convolutional layers and pooling (subsampling) layers [7]. The pooling layer plays an important 
role in CNNSs since it is mainly responsible for the invariance to data variation and perturbation. The pooling 


Journal homepage: http://jiaescore.com/journals/index.php/ijeecs 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 O 559 


layer can be obtained through pooling operation which typically contains two steps. Firstly, pooling operator 
scans the feature map and aggregates information (responses) within each local region [8]. 

Deep convolution neural network has shown the trend of dominating computer vision’s world by 
triumphing over traditional machine intelligence approaches. It has excelled in image classification [9, 10], 
character recognition [11] and many more. The robustness of CNN has also been shown in work by [12] 
where the layers of CNN are studied to better understand how viewpoint invariance 1s achieved. The datasets 
used are RGB-D dataset [13] and Pascal3D+ dataset [14] as both datasets contain multiview tabletop objects 
and images of object categories exhibiting high variability, captured in uncontrolled settings and under many 
different poses. When used on pre-trained CNNs, the network captures representations that highly preserve 
the manifold structure at most of the network layers, including the fully connected layers. It is also noted that 
the later layers such as Pool 5 shows better representation for the view-manifold than early layers like Pool 1. 
When the network is fine-tuned, very good pose estimation performance is achieved. In [15], CNN’s 
tolerance to image variations such as translation, scale, pose and illumination are investigated using a large- 
scale synthetic dataset. Their datasets consists of image objects of 16 categories, 8 rotation angles, 11 
cameras on a semicircular arch for random viewpoints and distances, 5 lighting conditions, 3 focus levels, 
variety of backgrounds generating over 20 million images in total. A pre-trained network, namely AlexNet 
CNN is used to discover the expressive capabilities of the pool5 and fc7 layers for object and parameter 
predictions. Results showed that both fc7 and pool5 produced accurate results for classification implying that 
these layers have good discriminative power for object recognition. As for parameter predictions, the 
accuracies for lighting, rotation, and camera are 100%, 77%, and 62%., respectively. This implies that the 
layers have difficulties most in predicting camera view, while lighting variations pose very little prediction 
difficulty. As changing camera views alter produce geometric variants, prediction task become more 
complicated. This finding is further proven when an assessment was done on the network to learn the power 
of CNN in transferring the learned parameter over one object category to another. Lighting parameter is 
easily transferred from seen objects to unseen object, while variations in camera views degraded the 
performance of the network for unseen objects. As lighting variations do not transform the geometric shapes 
of the seen object, it has the simplest knowledge to be transferred on unseen categories. 

Even though viewpoint or camera view invariance remains a challenge in any object 
recognition [17], we only consider illumination invariant in this paper. Despite the fact that CNNs have 
shown to handle illumination invariant for object recognition, we believe that further study need to be done 
for images of uniform illumination invariant as most work focus only on non-uniform invariant [18]. Figure 1 
and 2 illustrate previous work [15, 16] on illumination invariant where the image datasets used are only non- 
uniform illumination invariant caused by changes in lighting intensity, direction and spectrum. 
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Figure 1. A boat under 5 different illuminations [15] 
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On the other hand, Vonikakis et al. [18] constructed a new benchmark dataset named Phos featuring 
a greater variety of imaging conditions, compared to existing databases; containing images captured both 
under uniform and non-uniform illumination. Phos comprises 15 scenes captured under different illumination 
conditions. Each scene contains 9 images taken under dissimilar uniform illuminations, and 6 images under 
different degrees of non-uniform illuminations. Figure 3 demonstrates an example of a scene in Phos 
database. Uniform illumination is attained by using several diffiusive light sources evenly distributed around 
the objects. Using the shutter speed, four overexposed (+) and four underexposed (-) images are generated 
from the correct exposure image. A strong directional light source with different strengths is projected to the 
objects producing 6 non-uniform illumination images. 
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Figure 3. RGB images taken at the same location every hour of a 24-hour period [16]. 


Even though, Phos image datasets are valuable and is freely accessible for public use, the uniform 
illumination is in a controlled environment. Under and over-exposed pixels in RGB images are the major 
source of artefacts in the illumination invariant, particularly during daylight hours [16]. Therefore, our first 
contribution in this paper is the collection of uniform illumination images under natural source illuminations. 
Our uniform illumination images are ear images captured in uncontrolled indoor and outdoor environment. 
We choose ear biometrics domain as ear recognition using deep CNN is still scarce due to lack of large-scale 
datasets [19]. At the time of writing, only four published work [19]-[22] of employing deep learning for ear 
recognition are found, with two papers from the same team. Although they also utilized CNN architecture, 
but the datasets used are under controlled environment or ear images with non-uniform illumination captured 
in the wild. Consequently, this paper intends to investigate the accuracy of CNN for ear biometrics captured 
under uniform illumination invariant environment. 


2. THE UNIFORM ILLUMINANT INVARIANT EAR IMAGE DATASET 

Illumination is the amount of light present in an area [23] and is measured in lux (1x) or lumen (Im) 
per square meter. Lux, also known as luminance is a measurement of brightness measured using a lux meter. 
The uniqueness of our datasets is that the ear images are acquired in the indoor and outdoor environments, 
and illuminations are objectively quantified using lux meter at actual locations. The acquisition process 
started with the subject seated on a provided chair. The subject's ear is faced towards the camera, and the 
camera is set to be at the same level with the ears. Then, the illumination is measured using lux meter. The 
lux meter sensor is placed directly under the ear to ensure an accurate reading. Details of the image 
acquisition can be found in our previous work [24]. For the uniform illumination variance, we managed to 
collect ear images with illuminations ranging from 2 lux to 10700 lux from 55 subjects. Ten ear images are 
collected from each subject, with various lux values depending on the surrounding lighting. The final 
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compilation of ear dataset consists of a total of 1,100 images comprising 550 left and 550 right images. 
Table 1 lists the collection of ear images in different lux for both left and right ear images. Meanwhile, 
Figure 4 shows some examples of the ear images in selected lux values. The resolution for the ear images are 
2048 x 1056 pixels. 


Table 1. Lux variations of the ear image dataset 
Lux Left Right Total Lux Left Right Total Lux Left Right Total 
1-10 130 =: 106 236 101-200 52 50 102 2001-3000 


—_— 
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—_— 
A 
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11-20 54 80 134 201-300 34 32 66 3001-4000 4 6 10 
21-30 32 28 60 301-400 48 50 98 4001-5000 8 8 16 
31-40 6 6 12 401-500 2 8 10 5001-6000 6 6 12 
41-50 30 36 66 501-600 4 6 10 6001-7000 0 2 2 
51-60 14 8 22 601-700 6 4 10 7001-8000 6 2 8 
61-70 8 4 12 701-800 4 4 8 8001-9000 2 2 4 
71-80 10 18 28 801-900 6 4 10 9001-10000 0 0 0 
81-90 Vip 16 38 901-1000 2 6 8 10001-11000 0O 2 Ps 
91-100 12 14 26 1001-2000 32 28 60 Total 550 550 1100 





5 lux 18 lux 33 lux 





85 lux 





960 lux 1180 lux 2180 lux 5110 lux 


Figure 4. Our uniform illumination-invariant ear image dataset 


Data augmentation using rotation at every 5° is applied to the ear images to generate adequate data 
training and testing for the CNN model. After data augmentation, a total of 25,300 images are generated and 
for each subject, we chose approximately 88.3% (i.e. 410 images per subject) to be used as training dataset 
and the remaining 11.7% (50 images per subject) is used for validation and testing datasets. 


3. PROPOSED CNN MODEL 

In this paper, we present the use of deep convolutional neural networks (CNN) to identify a person 
based on ear biometrics. On the basis of an analysis of structure and parameters in CNN, the gradient-descent 
algorithm can be applied to train CNN. A total of 25000 ear images are resized to 125x125 to correlate with 
the size of the input layer, and then the resized images are used to train the CNN model. This CNNs model 
can improve the convergence speed while training the parameters in CNNs, and obtain higher recognition 
accuracy compared to conventional neural network or support vecto machine. The advantages of the 
proposed CNN are that images can be input directly to the model. Figure 5 demonstrates the proposed CNN 
model and the layers are subsequently described in the following subsections. 
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3.1. Convolutional Layers 

First, an image input layer inputs images to a network and applies data normalization. Then, we can 
classify images from a reduced data set applying convolution and pooling [25]. The details of the first 
convolutional layer are presented in Table 2. 


Convolution Convolution 


Convolution 


125x125x1 
3x3x16 Batch 
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Padding =4 Normsiizstion 


i | 16x16x64 


i | Padding=1 


Batch 
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Subject 55 





Cross entropy Softmsax Fully connected Isyer 


Figure 5. Our proposed CNN model 


Table 2. First convolutional layer of our proposed CNN 


Layer Type Layer Detail 
Input image (125,125,1) 
Filters 16 
Filter size a3 
Number of weights per filter 3*3*1=9 
Number of parameters in the layer (9+1)*16=160 
Stride 1 
Number of neurons in each feature map (125-3+2*1)/1+1=125 
Total number of neurons in the layer 125*125* 16=250000 


Secondly, a batch normalization layer is added after the convolutional layer to speed up training of 
convolutional neural networks and reduce the sensitivity to network initialization. Batch Normalization 
allows us to use much higher learning rates and be less careful about initialization. It also acts as a 
regularizer, in some cases eliminating the need for Dropout [26]. Thirdly, Rectified Linear Units (ReLU) 1s 
used as the activation function. The ReLU layer performs a threshold operation to each pixel element of the 
input as seen in Equation | [27], where f(x) is the image and x is the pixel element. If the pixel element, x is 
less than zero it will be reset to 0. Otherwise, the pixel element remains as is. 


4G x >0 (1) 
f~={5 x <0 


J 


Fourthly, a max pooling layer performs down-sampling by dividing the input into rectangular pooling 
regions, and computing the maximum of each region. We configured this layer to use 4 X 4 pooling region 
size and 2 for the stride. This structure is repeated 3 times to form 3 folds of layers. The max pooling layer is 
placed between each of the two folds as illustrated in Figure 5. 

Table 3 shows the layer details of the second convolutional layer. It consists of 32 filters of size 
6 X 6 and padding of 1. The convolutional layer is further followed by a pooling layer comprising 2 X 2 
pooling region size and 2 for the stride. The third convolutional layer contains 64 filters and the size of each 
filter is 16 X 16 as can be seen in Table 4. 
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Table 3. Second convolutional layer of our proposed CNN 


Layer Type Layer Details 

Filters 32 

Filter size 6x6 

Number of weights per filter 6x6x 1=36 
Number of parameters in the layer (36+1)*32=1184 
Stride 1 
Number of neurons in each feature map (125-324+2*1)/1+1=96 
Total number of neurons in the layer 96*96*32=294912 


Table 4. Third convolutional layer of our proposed CNN 


Layer Type Layer Details 
Filters 64 
Filter size 16x16 
Number of weights per filter 16x16x1=256 
Number of parameters in the layer (256+1)*64=16448 
Stride 1 
Number of neurons in each feature map (125-64+2*1)/1+1=64 
Total number of neurons in the layer 64*64*64=262 144 


3.2. Fully-Connected Layer 

In CNN, all the convolutional layers and pooling layers are followed by several fully-connected 
layers used for the determination in the output layer. The fully-connected layer can be considered as the 
hidden-layer in artificial neural network. All the neurons in the fully-connected layer are fully connected to 
the neurons of both previous and next layers. To reduce overfitting in the fully connected layer, a 
regularization method called “dropout” [28] is usually employed. Dropout technique enables the complex co- 
adaptations of neurons to reduce greatly. Therefore, it is forced to learn more robust features in an image and 
exhibits substantial overfitting. The output of the last fully connected layer is fed to a Softmax layer [29]. 
This layer has 55 outputs, which produce probability distributions calculated using Equation 2. The 
probablility is used to label each class, that is the recognized person. Hence, probabilities vector of size 1 x 
55 where each vector element corresponds to a class of dataset is obtained. 


eM (2) 
pj = weexk for j = 1..k 


where x is the net input. Output values of p are between 0 and | and their sum equals to 1. 


3.3. Classification Layer 
This layer is designated to count the error (loss) using cross-entropy function for 1-of-k coding 
scheme as in Equation 3. 


E(@) =- Zz: y tj; In y;(%}, 9) 


= (3) 

t=1 j=1 
where @ is the parameter vector, t;, 1s the indicator that the ith sample belongs to the jth class, and 
y;(x;,9) is the output for the sample, i. The output y,(x;,@) can be interpreted as the probability that the 
network associates ith input with class, j, that is, P(t; = 1|x;). The output unit activation function is the 
Softmax function. 


4. RESULTS AND DISCUSSIONS 

The proposed CNN model is trained using batches of images containing 60 images, with learning 
rate fixed at 0.001 to achieve a reliable training. We used Stochastic Gradient Descent with Momentum for 
training the CNN. The value of momentum is 0.9 inspired by [29]-[31]. The maximum number of epoch is 
set to 50 to ensure network convergence. The training and validation outcomes are presented in Figure 6. 
Since the CNN model is trained from scratch, the recognition accuracy started with a value close to 0 and 
gradually improves along with the number of iterations. After iteration 5,000, the accuracy stabilized at close 
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to value 100%. Therefore, the training is stopped at validation accuracy of 99.76% as further training is 
deemed unnecessary. The loss rate is illustrated in Figure 7. At the early stage of training, the loss rate is 
rather high at a value of close to 60. As the training proceeded, the loss rate gradually decreases eventually 
reaching to loss value close to 0. 
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Figure 7. Loss rate of the proposed CNN model 
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Figure 8. Wrongly recognized subjects with its corresponding luminance value 
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For testing, a total of 2,750 ear images are used for recognition and the result shows that 2,742 
images are correctly recognized. Figure 8 illustrates the wrongly recognized 8 subjects with its corresponding 
luminance value. As can be seen, with the exception of one image, all the wrongly recognized images have 
lux values of less than 10. In total, there are 236 ear images (Refer Table 1) with the lux values of less than 
10, and our proposed CNN model is able to recognize 229 images (97%) correctly. Even though the ear in 
the image is not even visible by naked eyes, the proposed model is able to correctly recognized majority of 
the subjects. As for subject #38, the augmented left ear of subject #14 is wrongly recognized even though 
adequate luminance of 22 lux is recorded for the image. This is probably due to the rotated operation 
performed on the image. 


5. CONCLUSION 

Deep learning using convolution neural network (CNN) has shown to produce exceptional 
performance over traditional methods in many application domains [32]. In this paper, a CNN model is built 
and trained for ear biometrics in various uniform illuminations measured using lumens. As far as we know, 
this is the first work of testing the performance of CNN on very underexposed or overexposed images. Our 
results showed that for images uniformly illuminated with luminance of above 25 lux, 100% recognition 
accuracy is achieved. Even though the CNN model has problems recognizing few images of below 10 lux, 
the overall accuracy of 97% suggests that CNN architecture performs equally well for uniform illumination- 
invariant images. However, data augmentation for our dataset only involves rotations. More robust operations 
should be included in data augmentation to ultimately test the performance of CNN model for ear biometrics. 
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